Skip to content

MDEV-38975: HEAP engine BLOB/TEXT/JSON/GEOMETRY column support#4735

Open
arcivanov wants to merge 29 commits intoMariaDB:10.11from
arcivanov:MDEV-38975
Open

MDEV-38975: HEAP engine BLOB/TEXT/JSON/GEOMETRY column support#4735
arcivanov wants to merge 29 commits intoMariaDB:10.11from
arcivanov:MDEV-38975

Conversation

@arcivanov
Copy link
Copy Markdown
Contributor

@arcivanov arcivanov commented Mar 5, 2026

Allow BLOB/TEXT/JSON/GEOMETRY columns in MEMORY (HEAP) engine
tables by storing blob data in variable-length continuation record chains
within the existing HP_BLOCK structure.

Additionally, promote VARCHAR fields whose octet_length > 32 bytes to
BLOB when the temp table uses the HEAP engine, storing only actual data
in continuation chains instead of reserving the full declared width in
every fixed-width row. This is the same promotion mechanism that
type_handler_for_tmp_table() applies at CONVERT_IF_BIGGER_TO_BLOB
(512 characters), just at a lower byte-based threshold for HEAP.

HASH index key handling for BLOB columns: the HEAP hash functions
(hp_rec_hashnr(), hp_rec_key_cmp(), hp_key_cmp(), hp_make_key(),
hp_hashnr()) are extended to handle HA_BLOB_PART key segments,
reading actual blob data via pointer dereference or chain materialization.
This enables GROUP BY, DISTINCT, and UNION deduplication to work
correctly on BLOB/TEXT columns entirely within the MEMORY engine. A hash
pre-check optimization skips expensive blob materialization when hash
values differ, and PAD SPACE collation semantics are preserved for blob
key comparisons.

Continuation runs: blob data is split across contiguous sequences
of recbuffer-sized records. Multi-record runs store a 10-byte header
(next_cont pointer + run_rec_count) in the first record; inner
records (rec 1..N-1) have no flags byte — full recbuffer payload.
Runs are linked via next_cont pointers. Individual runs are capped
at 65,535 records (uint16 format limit); larger blobs are
automatically split into multiple runs.

Blob storage format (enum hp_blob_format, decoded by
hp_blob_run_format() from the flags byte — single detection
mechanism used by all write, read, free, scan, and check paths):

  • Case A (HP_BLOB_CASE_A_SINGLE_REC): blob fits in one record.
    No run header — data starts at offset 0, full visible bytes
    available. Saves 10 bytes per small blob. Zero-copy: chain.
  • Case B (HP_BLOB_CASE_B_ZEROCOPY): single run, multiple records.
    Header in rec 0, data contiguous in rec 1..N-1.
    Zero-copy: chain + recbuffer.
  • Case C (HP_BLOB_CASE_C_MULTI_RUN): one or more runs linked via
    next_cont. Header + data in each run. Requires reassembly into
    blob_buff.

HP_INFO::has_zerocopy_blobs tracks zero-copy state; used by
heap_update() to refresh the caller's record buffer after freeing
old chains, preventing dangling pointers. Under HEAP's current
table-level locking (thr_lock), other handles are blocked during
UPDATE, so cross-handle dangling pointers cannot occur. For internal
temporary tables (single handle, single thread), this is definitively
safe.

Unchanged blob optimization: heap_update() compares each blob
column's old and new values before rewriting chains. Detection order:
length (O(1)), data pointer (O(1)), memcmp (O(n) with early exit).
Unchanged blobs keep existing chains with no allocation, copy, or free.
Only changed blobs get new chains written.

Free list scavenging: on insert, the free list is walked read-only
(peek) tracking contiguous groups in descending address order (LIFO).
Qualifying groups (>= min_run_records) are unlinked and used. The
first non-qualifying group terminates the scan — remaining data is
allocated from the block tail. The free list is never disturbed when
no qualifying group is found.

Record counting: new HP_SHARE::total_records tracks all physical
records (primary + continuation). HP_SHARE::records remains logical
(primary-only) to preserve linear hash bucket mapping correctness.

Scan/check batch-skip: heap_scan() and heap_check_heap() use
hp_blob_run_format() to detect the storage format. Case A skips one
record; Case B/C read run_rec_count from the header and skip the
entire run at once.

Blob key format bridge: Field_blob::new_key_field() returns
Field_varstring (2B length + inline data), but HEAP expects
hp_make_key format (4B length + data pointer). Precomputed
HP_KEYDEF::has_blob_seg flag triggers key rebuild from record[0]
via hp_make_key() in ha_heap::index_read_map() and related methods.
GROUP BY key setup sets key_part_flag from field->key_part_flag()
(not hardcoded 0) so HA_BLOB_PART reaches HEAP's blob code paths.
For promoted blob columns where Field_blob::key_length() returns 0,
the GROUP BY key field uses the item's max_length instead, matching
the buffer size allocated by calc_group_buffer().

SQL layer: choose_engine() no longer rejects HEAP for blob tables
(replaced blob_fields check with reclength > HA_MAX_REC_LENGTH).
remove_duplicates() routes HEAP+blob to remove_dup_with_compare().
ha_heap::remember_rnd_pos() / restart_rnd_next() implemented for
DISTINCT deduplication support. Fixed undefined behavior in
test_if_cheaper_ordering() where select_limit/fanout could overflow
to infinity — capped at HA_POS_ERROR.

VARCHAR→BLOB promotion: extracted pick_engine() from
choose_engine(), called early in start() to set m_heap_expected.
VARCHAR fields with octet_length > HEAP_CONVERT_IF_BIGGER_TO_BLOB
(32 bytes) are promoted to BLOB via Tmp_field_param::varchar_to_blob_threshold.
For I_S tables, the same promotion is applied in add_schema_fields()
using the field definition's octet length.

https://jira.mariadb.org/browse/MDEV-38975

@arcivanov
Copy link
Copy Markdown
Contributor Author

@gkodinov @dr-m

@grooverdan
Copy link
Copy Markdown
Member

grooverdan commented Mar 5, 2026

@arcivanov , a truely impressive first PR.

https://mariadb.com/docs/server/server-management/install-and-upgrade-mariadb/installing-mariadb/compiling-mariadb-from-source/compile-and-using-mariadb-with-sanitizers-asan-ubsan-tsan-msan#buildbots-msan-container for help resolving the MSAN errors. A dev_debian13-msan-clang-22 tagged MSAN container was built yesterday.

Full ci results - https://buildbot.mariadb.org/#/grid?branch=refs%2Fpull%2F4735%2Fhead

Do reach out if you need help - https://mariadb.zulipchat.com

@arcivanov arcivanov force-pushed the MDEV-38975 branch 2 times, most recently from ced75a6 to 57fc52e Compare March 5, 2026 07:59
@gkodinov gkodinov added the External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements. label Mar 5, 2026
Copy link
Copy Markdown
Member

@gkodinov gkodinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution! This is a preliminary review.

There's nothing formally wrong with the diff. One remark: this seems like a new feature to me. So, technically, it should be going to main according to policy.

But I'll leave that decision to the final reviewer.

LGTM. Please stay tuned for the final review.

Copy link
Copy Markdown
Member

@gkodinov gkodinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed the tests failing. Please make sure buildbot doesn't produce any failures.

@arcivanov
Copy link
Copy Markdown
Contributor Author

The tests are being fixed right now.

@arcivanov arcivanov force-pushed the MDEV-38975 branch 5 times, most recently from 08ec2cf to 7fbaf3e Compare March 5, 2026 11:32
Copy link
Copy Markdown
Member

@gkodinov gkodinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing more to add. Please stay tuned for the final review.

@gkodinov gkodinov requested a review from montywi March 5, 2026 12:07
@montywi
Copy link
Copy Markdown
Contributor

montywi commented Mar 5, 2026

I have added a lot of comments about the design in https://jira.mariadb.org/browse/MDEV-38975
(looks good)

@arcivanov arcivanov force-pushed the MDEV-38975 branch 6 times, most recently from f1dc279 to 15023ad Compare March 6, 2026 05:52
@arcivanov
Copy link
Copy Markdown
Contributor Author

The https://buildbot.mariadb.org/#builders/534/builds/35108 failing test is a flaky timeout issue.

This is rpl.rpl_semi_sync_shutdown_await_ack - a replication semi-sync test, nothing to do with HEAP/BLOB. It's a timeout waiting for a semi-sync ACK during shutdown, failing across all three binlog formats (stmt, row, mix). This is a flaky replication test, not caused by this feature.

Not being in MariaDB I can't press rebuild to clear it so if anybody would be so kind?

@arcivanov arcivanov changed the title MDEV-38975: HEAP engine BLOB/TEXT column support MDEV-38975: HEAP engine BLOB/TEXT/JSON/GEOMETRY column support Mar 6, 2026
@arcivanov
Copy link
Copy Markdown
Contributor Author

The https://buildbot.mariadb.org/#builders/534/builds/35108 failing test is a flaky timeout issue.

This is rpl.rpl_semi_sync_shutdown_await_ack - a replication semi-sync test, nothing to do with HEAP/BLOB. It's a timeout waiting for a semi-sync ACK during shutdown, failing across all three binlog formats (stmt, row, mix). This is a flaky replication test, not caused by this feature.

Not being in MariaDB I can't press rebuild to clear it so if anybody would be so kind?

Nevermind, I pushed a commit update, hopefully next run is more lucky

@grooverdan
Copy link
Copy Markdown
Member

The https://buildbot.mariadb.org/#builders/534/builds/35108 failing test is a flaky timeout issue.
This is rpl.rpl_semi_sync_shutdown_await_ack - a replication semi-sync test, nothing to do with HEAP/BLOB. It's a timeout waiting for a semi-sync ACK during shutdown, failing across all three binlog formats (stmt, row, mix). This is a flaky replication test, not caused by this feature.
Not being in MariaDB I can't press rebuild to clear it so if anybody would be so kind?

Nevermind, I pushed a commit update, hopefully next run is more lucky

Those tests do sound like usual suspects. Don't worry, review isn't going to stop because of a few tests and we'll retrigger if important.

MSAN happy after your first update and still is.

Thanks.

@arcivanov
Copy link
Copy Markdown
Contributor Author

This PR is finalized pending review. I don't expect any more architectural changes or refactorings unless requested.

Allow BLOB/TEXT/JSON/GEOMETRY columns in MEMORY (HEAP) engine tables
by storing blob data in variable-length continuation record chains
within the existing `HP_BLOCK` structure.

**Continuation runs**: blob data is split across contiguous sequences
of `recbuffer`-sized records. Each run stores a 10-byte header
(`next_cont` pointer + `run_rec_count`) in the first record; inner
records (rec 1..N-1) have no flags byte — full `recbuffer` payload.
Runs are linked via `next_cont` pointers. Individual runs are capped
at 65,535 records (`uint16` format limit); larger blobs are
automatically split into multiple runs.

**Zero-copy reads**: single-run blobs return pointers directly into
`HP_BLOCK` records, avoiding `blob_buff` reassembly entirely:
- Case A (`run_rec_count == 1`): return `chain + HP_CONT_HEADER_SIZE`
- Case B (`HP_ROW_CONT_ZEROCOPY` flag): return `chain + recbuffer`
- Case C (multi-run): walk chain, reassemble into `blob_buff`
`HP_INFO::has_zerocopy_blobs` tracks zero-copy state; used by
`heap_update()` to refresh the caller's record buffer after freeing
old chains, preventing dangling pointers.

**Free list scavenging**: on insert, the free list is walked read-only
(peek) tracking contiguous groups in descending address order (LIFO).
Qualifying groups (>= `min_run_records`) are unlinked and used. The
first non-qualifying group terminates the scan — remaining data is
allocated from the block tail. The free list is never disturbed when
no qualifying group is found.

**Record counting**: new `HP_SHARE::total_records` tracks all physical
records (primary + continuation). `HP_SHARE::records` remains logical
(primary-only) to preserve linear hash bucket mapping correctness.

**Scan/check batch-skip**: `heap_scan()` and `heap_check_heap()` read
`run_rec_count` from rec 0 and skip entire continuation runs at once.

**Hash functions**: `hp_rec_hashnr()`, `hp_rec_key_cmp()`, `hp_key_cmp()`,
`hp_make_key()` updated to handle `HA_BLOB_PART` key segments — reading
actual blob data via pointer dereference or chain materialization.

**SQL layer**: `choose_engine()` no longer rejects HEAP for blob tables
(replaced `blob_fields` check with `reclength > HA_MAX_REC_LENGTH`).
`remove_duplicates()` routes HEAP+blob to `remove_dup_with_compare()`.
`ha_heap::remember_rnd_pos()` / `restart_rnd_next()` implemented for
DISTINCT deduplication support. Fixed undefined behavior in
`test_if_cheaper_ordering()` where `select_limit/fanout` could overflow
to infinity — capped at `HA_POS_ERROR`.

https://jira.mariadb.org/browse/MDEV-38975
montywi added 3 commits April 27, 2026 23:40
…bles

group_concat need special code in store() for storing the result
in table->blob_storage. This was implemented in Field_blob::store()
but not in Field_blob_compressed::store. This was temporarly solved
by not using Field_blob_compressed table->s->is_optimizer_tmp_table()
would be set. This this however disable Field_blob_compressed for
temporary tables that did not need a key for the blob field.

Fixed by ensuring that Field_blob::store and Field_blob_compressed::store
handles group_concat identically. As this handling is only done for
internal temporary tables, we store the data uncompressed for faster
usage by group_concat()
The problem was that create_internal_tmp_table() tries to create a
normal key for the blob, which does not work.
The fix is to force a unique key if BLOB keys was used for the
the orignal HEAP table.
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ arcivanov
❌ montywi
You have signed the CLA already but the status is still pending? Let us recheck it.

@arcivanov arcivanov force-pushed the MDEV-38975 branch 3 times, most recently from 5dde250 to 39eda4e Compare May 1, 2026 04:39
arcivanov added 2 commits May 1, 2026 02:05
Apply Monty's review feedback across HEAP blob implementation:

**`hp_update.c`:**
- Hoist `HP_BLOB_DESC *desc` to block scope, use `desc++` in all three
  blob loops instead of re-indexing `&share->blob_descs[i]` each iteration
- Move `new_len` declaration to block scope, remove inner `{ }` wrapper
  block, dedent the write-new-chains loop body by one level
- Replace `if (blob_changed[i]) any_changed= TRUE` with branchless
  `any_changed|= blob_changed[i]`
- Add braces to rollback `for (j= 0; j < i; j++)` loop body
- Update chain pointer restoration comment to explain `pos` vs `old`
  pointer semantics for segmented blobs
- Rename inner loop `new_len` to `cur_len` to avoid shadowing block-scope
  `new_len`

**`read_lowendian()` move (F33/F63):**
- Move `read_lowendian()` from `sql/field.h` to `include/my_base.h` so
  pure-C storage engines can use it
- Convert `hp_blob_length()` from standalone function in `hp_blob.c` to
  `static inline` wrapper in `heapdef.h` calling `read_lowendian()`
- Convert `hp_blob_key_length()` in `hp_hash.c` to `static inline`
  wrapper calling `read_lowendian()`

**Test renames** (Monty's naming convention — drop `heap_` prefix):
- `heap_blob.test` → `blob.test`
- `heap_blob_big{1,2,3}.test` → `blob_big{1,2,3}.test`
- `heap_blob_big.inc` → `blob_big.inc`
- `heap_blob_groupby.test` → `blob_group_by.test`
- `heap_blob_ops.test` → `blob_ops.test`

**Other files:** Style fixes from earlier feedback items applied across
`hp_blob.c`, `hp_write.c`, `hp_hash.c`, `hp_scan.c`, `hp_delete.c`,
`ha_heap.cc`, `heapdef.h`, `_check.c`, `heap.h`, `field.h`,
`sql_select.cc`, and test files.
# Conflicts:
#	sql/mysqld.cc
#	sql/sql_parse.cc
arcivanov added 5 commits May 1, 2026 02:23
`hp_alloc_from_tail()` now takes `uint *blocks` (in/out) and allocates
a contiguous batch of records from the current leaf block in one call,
replacing the per-record inner loop in `hp_write_one_blob()` Step 2.

The caller pre-computes the record count needed for the chosen storage
format (Case B for `is_only_run`, Case C otherwise), and the function
returns however many are available up to the request. The flat
if/else-if/else then selects Case A, B, or C based on the actual count.

This eliminates the record-by-record extension loop, both contiguity
guards with `abort()`, and the Case B extra-record allocation logic,
reducing Step 2 from ~170 lines to ~60.
Two fixes in `hp_write_one_blob()`:

**Bug fix**: Step 1 free-list contiguity detection failed to update
`prev_pos` inside the contiguity branch, so the check
`pos == prev_pos - recbuffer` could only detect 2-record groups.
The third record was always compared against the original `prev_pos`
(2 recbuffers away), causing a false discontinuity. Fix: add
`prev_pos = pos` after `run_start = pos`.

**Deficiency MariaDB#2**: When Step 2 (tail allocation) fails with
`HA_ERR_RECORD_FILE_FULL` and there are still deleted records on the
free list, a new Step 3 walks the entire free list accepting any
contiguous group (even single slots). Each group is written as a
Case C run via `hp_unlink_and_write_run()`. This produces maximally
fragmented chains, which are slower to read but correct. Failing with
table-full when free slots exist is worse than a fragmented chain.

Tests:
- `hp_test_freelist-t.c`: 38 unit tests covering contiguity detection
  (prev_pos bug guard), repeated delete-reinsert cycles, Step 3
  scavenge fallback, and true capacity exhaustion
- `heap/blob_fallback.test`: MTR test exercising the fallback at SQL
  level with fragmented free list
- Extracted shared `hp_test_helpers.h` from duplicate code in
  `hp_test_scan-t.c` and `hp_test_freelist-t.c`
`SHOW STATUS LIKE 'Created_tmp%'` counts include the extra temp table
created by prepared statement re-execution under `--ps-protocol`. The
test already disabled `cursor_protocol` and `ps2_protocol` but missed
`ps_protocol`, causing `blob_big1`/`blob_big2`/`blob_big3` to fail on
CI builders that run with `--ps-protocol`.
`heap_prepare_hp_create_info()` mishandled blob key segments whose
field type is `Field_blob` or `Field_geom` (as opposed to
`Field_blob_key`).

`Field_blob::key_type()` returns `HA_KEYTYPE_VARBINARY2` (geometry)
or `HA_KEYTYPE_VARTEXT2` (text blobs).  Commit `30415846402`
("Introduce Field_blob_key") added logic that stripped `HA_BLOB_PART`
from these segments, assuming they use "VARCHAR packing."  This is
wrong: DISTINCT/UNION key fields are `Field_blob` (not
`Field_blob_key`), and their record format is a blob descriptor
(`packlength` bytes of length + 8-byte data pointer), not a varchar
(2-byte length prefix + inline data).

After `HA_BLOB_PART` was stripped, `hp_create.c` normalized
`VARBINARY2` → `VARTEXT1`.  The hash function `hp_rec_hashnr()` then
entered the `VARTEXT1` branch, which reads the first 2 bytes as a
varchar length and hashes that many bytes starting at offset 2.  For a
geometry blob descriptor, the first 2 bytes are the low bytes of the
WKB data length (e.g. ~100 for a simple polygon), so the hash read
~100 bytes starting inside the 12-byte descriptor — overshooting into
adjacent fields or uninitialized record buffer memory.

This caused MSAN "use-of-uninitialized-value" crashes in
`innodb_gis.1`, `innodb_gis.point_basic`, `main.gis`, and
`innodb_gis.gis` on the `amd64-msan-clang-20` CI builder.  Beyond the
MSAN crash, it was also a functional bug: hashing the raw pointer
bytes meant two rows with identical geometry data but different memory
addresses would hash differently, breaking UNION DISTINCT
deduplication.

**Fix**: when `HA_BLOB_PART` is set and the key type is
`VARBINARY2`/`VARTEXT2`, promote to `VARBINARY4`/`VARTEXT4` instead
of stripping the flag.  Set `bit_start` to the actual `packlength`
and `length` to `4 + sizeof(pointer)`.  `hp_create.c` then normalizes
to `VARTEXT4` and the hash/compare functions use the blob path:
dereference the pointer and operate on the actual data.
3-phase `heap/blob_stress` MTR test exercising free-list
fragmentation, continuation chain reuse, and data integrity under
sustained mixed DML:

- Phase 1: 200-cycle stored procedure — 5 inserts, 2 deletes,
  3 updates per cycle with blob sizes cycling through Case A/B/C;
  shadow table row count verification with `SIGNAL` on mismatch
- Phase 2: near-capacity (2 MB) fill/fragment/refill with free-list
  scavenging, then full-delete reinsert
- Phase 3: 20 grow/shrink `UPDATE` cycles (even rows 10-18 KB,
  odd rows 5-20 bytes)

All phases verify blob content integrity via single-character
`REPEAT` pattern check. Addresses Monty's F14 feedback.
# Conflicts:
#	mysql-test/main/metadata.result
`Field_geom::store()`: assert `blob_storage` is not set, catching
any future removal of the MDEV-16699 `group_concat` downgrade in
`Field_blob::make_new_field()`.

`heap_prepare_hp_create_info()`: after `HA_BLOB_PART` promotion,
assert key type is `VARTEXT4`/`VARBINARY4` and `bit_start` is 1-4,
catching blob key segments that were not promoted from `VARTEXT2`.
@arcivanov arcivanov force-pushed the MDEV-38975 branch 2 times, most recently from d61b485 to 0d6a069 Compare May 1, 2026 20:19
- `Field_blob::get_key_image_itRAW` and `Field_blob::key_cmp`: cast
  `local_char_length` to `uint32` in `set_if_smaller()` — safe because
  `charpos()` is bounded by `blob_length` which is already `uint32`.
- `hp_test_hash-t.c`: widen `blob_len` parameter from `uint16` to
  `size_t` in `build_record()` and `build_mixed_record()` to match
  `LEX_CUSTRING::length` type.
@arcivanov arcivanov force-pushed the MDEV-38975 branch 3 times, most recently from 47b989d to 6e5e5d3 Compare May 2, 2026 00:39
arcivanov added 2 commits May 1, 2026 20:54
When `ha_update_tmp_row()` fails with `HA_ERR_RECORD_FILE_FULL` on a
HEAP temp table (e.g., `MAX(TEXT)` aggregate growing the blob during
GROUP BY accumulation), convert the table to Aria and retry the update
— matching the existing INSERT overflow handling.

**Mechanism**: `create_internal_tmp_table_from_heap()` copies all rows;
`record[0]` write is rejected as duplicate (same GROUP BY key).
`get_dup_key()` populates `dup_ref`, then `ha_rnd_pos()` locates the
old row in Aria for the update.

**Two call sites fixed**:
- `end_update()`: switches to `end_unique_update` after conversion
- `end_unique_update()`: restores INDEX mode via `rnd_inited` flag

`GROUP_CONCAT` does NOT trigger this path — its `update_field()` is
`DBUG_ASSERT(0)` (it accumulates internally). `MAX(TEXT)` / `MIN(TEXT)`
are the aggregates that write growing blobs via `result_field->store()`.
When `hp_write_one_blob()` fails partway through tail allocation
(e.g. `HA_ERR_RECORD_FILE_FULL`), `hp_free_run_chain()` puts the
partial chain onto the delete list. These records were just
tail-allocated but once on the delete list they could only be reused
via free-list scavenging, not tail allocation, so `last_allocated`
only grew forward.

Add `hp_shrink_tail()` which pops tail-positioned records from the
delete list head and decrements `last_allocated`. Crosses block
boundaries by locating the previous leaf block via `hp_find_block()`
and updating `last_blocks`. Empty blocks stay allocated in the tree
(freed at table drop).

Add `high_water_allocated` to `HP_BLOCK` to track the peak
`last_allocated` before shrinking. In `hp_alloc_from_tail()`, when
`block_pos == 0` and `last_allocated < high_water_allocated`, the
next leaf block already exists in the tree: reuse it via
`hp_find_block()` instead of calling `hp_get_new_block()`, avoiding
memory waste from duplicate block allocations.

Unit tests (41 new assertions in `hp_test_freelist-t.c`):
- Single-block tail reclaim after failed blob insert
- Cross-block reclaim (2 blocks, 1 boundary crossing)
- 3-block reclaim (2 boundary crossings, `last_blocks` restoration)
- Orphaned block reuse: non-blob and blob inserts fill reclaimed
  blocks without growing `data_length`
arcivanov added 2 commits May 1, 2026 23:45
For blob tables, `find_unique_row()` previously materialized blobs
twice: once during `hp_rec_key_cmp()` (per-segment via
`hp_materialize_one_blob()`) and again via `hp_read_blobs()` after
the match was found.

Reorder the blob path to materialize-then-compare: save the input
record, copy the stored candidate, call `hp_read_blobs()` once to
materialize all blobs, then compare via `hp_rec_key_cmp()` with
`info=NULL` since both records now have direct data pointers.

Non-blob tables keep the original fast path unchanged.
Remove `enum hp_blob_format` and `hp_blob_run_format()` indirection.
Add `HP_ROW_MULTIPLE_REC` (bit 5) so all three blob storage formats
have a dedicated flag bit. Add named inline predicates
`hp_is_single_rec()`, `hp_is_zerocopy()`, `hp_is_multi_run()` matching
the existing `hp_is_active()`/`hp_has_cont()`/`hp_is_cont()` pattern.

Change `hp_write_run_data()` format parameter from enum to `uchar`
receiving bit constants directly; simplify flags byte assignment from
ternary to bitwise OR.

Addresses review feedback F127-F128, F130, F132-F134.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements. Rework pull requests that are currently being re-implemented by the MariaDB developers

Development

Successfully merging this pull request may close these issues.

5 participants